AITopics | confusion network

Collaborating Authors

confusion network

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM

Prakash, Jeena, Kumar, Blessingh, Hacioglu, Kadri, Sharma, Bidisha, Gopalan, Sindhuja, Chetlur, Malolan, Venkatesan, Shankar, Stolcke, Andreas

arXiv.org Artificial IntelligenceOct-6-2025

Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1707

2506.11089

Country: Asia (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SoftCTC -- Semi-Supervised Learning for Text Recognition using Soft Pseudo-Labels

Kišš, Martin, Hradiš, Michal, Beneš, Karel, Buchal, Petr, Kula, Michal

arXiv.org Artificial IntelligenceSep-19-2023

This paper explores semi-supervised training for sequence tasks, such as Optical Character Recognition or Automatic Speech Recognition. We propose a novel loss function $\unicode{x2013}$ SoftCTC $\unicode{x2013}$ which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence based filtering step which is otherwise a crucial component of pseudo-labeling approaches to semi-supervised learning. We demonstrate the effectiveness of our method on a challenging handwriting recognition task and conclude that SoftCTC matches the performance of a finely-tuned filtering based pipeline. We also evaluated SoftCTC in terms of computational efficiency, concluding that it is significantly more efficient than a na\"ive CTC-based approach for training on multiple transcription variants, and we make our GPU implementation public.

confusion network, probability, transcription, (14 more...)

arXiv.org Artificial Intelligence

2212.02135

Country:

Europe > Czechia > South Moravian Region > Brno (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.91)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.86)
(3 more...)

Add feedback

Streaming Speech-to-Confusion Network Speech Recognition

Filimonov, Denis, Pandey, Prabhat, Rastrow, Ariya, Gandhe, Ankur, Stolcke, Andreas

arXiv.org Artificial IntelligenceJun-2-2023

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.

artificial intelligence, hypothesis, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2306.03778

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems

Beneš, Karel, Kocour, Martin, Burget, Lukáš

arXiv.org Artificial IntelligenceMay-21-2023

End-to-end (e2e) systems have recently gained wide popularity in automatic speech recognition. However, these systems do generally not provide well-calibrated word-level confidences. In this paper, we propose Hystoc, a simple method for obtaining word-level confidences from hypothesis-level scores. Hystoc is an iterative alignment procedure which turns hypotheses from an n-best output of the ASR system into a confusion network. Eventually, word-level confidences are obtained as posterior probabilities in the individual bins of the confusion network. We show that Hystoc provides confidences that correlate well with the accuracy of the ASR hypothesis. Furthermore, we show that utilizing Hystoc in fusion of multiple e2e ASR systems increases the gains from the fusion by up to 1\,\% WER absolute on Spanish RTVE2020 dataset. Finally, we experiment with using Hystoc for direct fusion of n-best outputs from multiple systems, but we only achieve minor gains when fusing very similar systems.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.12579

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland (0.04)
Europe > Italy (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Transformer-based encoder-encoder architecture for Spoken Term Detection

Švec, Jan, Šmídl, Luboš, Lehečka, Jan

arXiv.org Artificial IntelligenceNov-2-2022

The paper presents a method for spoken term detection based on In this work, we do not focus on the direct processing of the the Transformer architecture. We propose the encoder encoder input speech signal. Instead, we use the speech recognizer to convert architecture employing two BERT-like encoders with additional an audio signal into a graphemic recognition hypothesis. The modifications, including convolutional and upsampling layers, attention representation of speech at the grapheme level allows preprocessing masking, and shared parameters. The encoders project a the input audio into a compact confusion network and further to a recognized hypothesis and a searched term into a shared embedding sequence of embedding vectors. In [7], we proposed a Deep LSTM space, where the score of the putative hit is computed using the calibrated architecture for spoken term detection, which uses the projection dot product. In the experiments, we used the Wav2Vec 2.0 of both the input speech and searched term into a shared embedding speech recognizer, and the proposed system outperformed a baseline space. The hybrid DNN-HMM speech recognizer produced method based on deep LSTMs on the English and Czech STD phoneme confusion networks representing the input speech. The datasets based on USC Shoah Foundation Visual History Archive DNN-HMM speech recognizer can be replaced with the Wav2Vec (MALACH).

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2211.01089

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Czechia (0.04)
North America > United States > Washington > King County > Seattle (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

Švec, Jan, Šmídl, Luboš, Psutka, Josef V., Pražák, Aleš

arXiv.org Artificial IntelligenceOct-21-2022

The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2021-1704

2210.11895

Country: Europe > Czechia (0.04)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer

Švec, Jan, Lehečka, Jan, Šmídl, Luboš

arXiv.org Artificial IntelligenceOct-21-2022

In recent years, the standard hybrid DNN-HMM speech recognizers are outperformed by the end-to-end speech recognition systems. One of the very promising approaches is the grapheme Wav2Vec 2.0 model, which uses the self-supervised pretraining approach combined with transfer learning of the fine-tuned speech recognizer. Since it lacks the pronunciation vocabulary and language model, the approach is suitable for tasks where obtaining such models is not easy or almost impossible. In this paper, we use the Wav2Vec speech recognizer in the task of spoken term detection over a large set of spoken documents. The method employs a deep LSTM network which maps the recognized hypothesis and the searched term into a shared pronunciation embedding space in which the term occurrences and the assigned scores are easily computed. The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec. The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2022-10409

2210.11885

Country:

Europe > Czechia (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Understanding Domain Specific Languages(CS)

#artificialintelligenceJul-6-2022, 08:40:35 GMT

Abstract: Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs.

application, domain corpora, ir task, (14 more...)

#artificialintelligence

Industry: Construction & Engineering (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.32)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

ConfNet2Seq: Full Length Answer Generation from Spoken Questions

Pal, Vaishali, Shrivastava, Manish, Besacier, Laurent

arXiv.org Artificial IntelligenceJun-11-2020

Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.

machine learning, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-030-58323-1_56

2006.05163

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Asia > India > Telangana > Hyderabad (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance

Fusayasu, Yohei (Kobe University) | Tanaka, Katsuyuki (Kobe University) | Takiguchi, Tetsuya (Kobe University) | Ariki, Yasuo (Kobe University)

AAAI ConferencesJul-15-2015

In spite of the recent advancements being made in speech recognition, recognition errors are unavoidable in continuous speech recognition. In this paper, we focus on a word-error correction system for continuous speech recognition using confusion networks.Conventional N-gram correction is widely used; however, the performance degrades due to the fact that the N-gram approach cannot measure information between long distance words. In order to improve the performance of the N-gram model, we employ Normalized Relevance Distance (NRD) as a measure for semantic similarity between words. NRD can identify not only co-occurrence but also the correlation of importance of the terms in documents. Even if the words are located far from each other, NRD can estimate the semantic similarity between the words. The effectiveness of our method was evaluated in continuous speech recognition tasks for multiple test speakers. Experimental results show that our error-correction method is the most effective approach as compared to the methods using other features.

confusion network, correction, nrd, (13 more...)

AAAI Conferences

Twenty-Fourth International Joint Conference on Artificial Intelligence

Country: Asia > Japan (0.04)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.87)

Add feedback